# A text preprocessing pipeline

The purpose is to provide a way to create uniform text data that is ready for analysis.

To get the purpose in goal we need text data and we can use the novel Frankenstein, which is available on Gutenberg.org.   

In [1]:
# import libaries
import urllib.request

# Get the 1818 edition of Frankenstein
url = 'https://gutenberg.org/cache/epub/41445/pg41445.txt'
raw_text = urllib.request.urlopen(url).read().decode()
# Get text
text_start = raw_text.find('PREFACE.')
text_end = raw_text.find('*** END OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN; OR, THE MODERN PROMETHEUS ***')
text = raw_text[text_start:text_end].strip() # Slice

In [2]:
# Identify noise in the text
text[:3000]

'PREFACE.\r\n\r\n\r\nThe event on which this fiction is founded has been supposed, by Dr.\r\nDarwin, and some of the physiological writers of Germany, as not of\r\nimpossible occurrence. I shall not be supposed as according the remotest\r\ndegree of serious faith to such an imagination; yet, in assuming it as\r\nthe basis of a work of fancy, I have not considered myself as merely\r\nweaving a series of supernatural terrors. The event on which the\r\ninterest of the story depends is exempt from the disadvantages of a mere\r\ntale of spectres or enchantment. It was recommended by the novelty of\r\nthe situations which it developes; and, however impossible as a physical\r\nfact, affords a point of view to the imagination for the delineating of\r\nhuman passions more comprehensive and commanding than any which the\r\nordinary relations of existing events can yield.\r\n\r\nI have thus endeavoured to preserve the truth of the elementary\r\nprinciples of human nature, while I have not scruple

## Cleaning text data

In [3]:
import re
def clean(text): 
    
    # match a variety of punctuation and special characters
    # backslash \ and the pipe symbols | plays important roles, for example here \? 
    # Now it is a good idea to look up a see what \ and | does 
    text = re.sub(r'\.|,|:|;|!|\?|\(|\)|\||\+|\'|\"|‘|’|“|”|\'|\’|…|\-|_|–|—|\$|&|\*|>|<|\/|\[|\]', ' ', text)

    # Regex pattern to match numbers and words containing numbers
    text = re.sub(r'\b\w*\d\w*\b', '', text)
    
    # Remove words with length 2 or less
    text = re.sub(r'\b\w{1,2}\b', '', text)
    
    # sequences of white spaces 
    text = re.sub(r'\s+', ' ', text) 

    # lower the letters
    text = text.lower()

    # return the text
    return text
    

clean(text)[:3000]

'preface the event which this fiction founded has been supposed darwin and some the physiological writers germany not impossible occurrence shall not supposed according the remotest degree serious faith such imagination yet assuming the basis work fancy have not considered myself merely weaving series supernatural terrors the event which the interest the story depends exempt from the disadvantages mere tale spectres enchantment was recommended the novelty the situations which developes and however impossible physical fact affords point view the imagination for the delineating human passions more comprehensive and commanding than any which the ordinary relations existing events can yield have thus endeavoured preserve the truth the elementary principles human nature while have not scrupled innovate upon their combinations the iliad the tragic poetry greece shakespeare the tempest and midsummer night dream and most especially milton paradise lost conform this rule and the most humble nov